Planning the 2021 experiment

Analysis and calculations in service of planning the 2021 follow-up experiment.

Michael R. McLaren
2021-06-22

Exploring DNA yields in the 2019 experiment

Load the DNA sample data and compute the concentration of input material for each aliquot in terms of pellets per aliquot, accounting for different homogenization volumes. Here I approximate the homogenization volume by the amount of PBS added; this is likely an ok approximation for getting relative concentration differences within sample types, which is the main purpose here.

dna <- here("output", "sample-data", "dna-sample-data.Rds") %>%
  readRDS %>%
  mutate(
    pellets_per_aliquot = number_of_pellets / volume_pbs * volume_per_aliquot
  ) %>%
  glimpse
#> Rows: 100
#> Columns: 27
#> $ dna_sample_id       <chr> "1_1", "1_2", "1_3", "1_4", "2_1", "2_2"…
#> $ specimen_id         <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4…
#> $ aliquot_number      <fct> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2…
#> $ specimen_type       <fct> T. mobilis, T. mobilis, T. mobilis, T. m…
#> $ collection_week     <fct> NA, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ collection_day      <fct> NA, NA, NA, NA, 1, 1, 1, 1, 2, 2, 2, 2, …
#> $ collection_date     <date> NA, NA, NA, NA, 2019-09-09, 2019-09-09,…
#> $ host_species        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ host_sex            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ extraction_batch    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ extraction_date     <fct> 2019-12-03, 2019-12-03, 2019-12-03, 2019…
#> $ number_of_pellets   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ number_of_aliquots  <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ volume_pbs          <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000…
#> $ volume_per_aliquot  <dbl> 200, 200, 200, 200, 200, 200, 200, 200, …
#> $ extraction_protocol <fct> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2…
#> $ dna_conc            <dbl> 0.279, 0.958, 0.357, 1.185, 0.283, 1.159…
#> $ dna_conc_s1         <dbl> 0.080, 0.720, NA, NA, 0.080, 0.761, 0.00…
#> $ dna_conc_s2         <dbl> 0.358, 1.550, NA, NA, 0.272, 3.200, 0.26…
#> $ well                <chr> "C4", "F9", "H11", "H12", "G7", "E5", "A…
#> $ row                 <ord> C, F, H, H, G, E, A, G, D, E, H, E, C, F…
#> $ column              <ord> 4, 9, 11, 12, 7, 5, 3, 3, 10, 12, 3, 10,…
#> $ A1                  <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TR…
#> $ A2                  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
#> $ S1                  <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TR…
#> $ S2                  <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TR…
#> $ pellets_per_aliquot <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, …
dna %>%
  count(specimen_type, number_of_aliquots, number_of_pellets, 
    pellets_per_aliquot
  ) %>%
  kable()
specimen_type number_of_aliquots number_of_pellets pellets_per_aliquot n
Fecal 4 1 0.200 64
Inoculum 2 4 1.778 2
Inoculum 4 1 0.200 12
Inoculum 4 4 0.941 4
Inoculum 4 5 1.176 4
Inoculum 4 6 1.412 8
T. mobilis 2 10 4.000 2
T. mobilis 4 1 0.200 4

In the second extraction batch, Angie pooled pellets of the inoculum and T. mobilis, and also added less PBS, in order to increase the biomass.

Compare DNA concentration estimates

We have three different DNA concentration measurements for most samples: Our own Qubit measurements, the Picogreen measurements from center S1, and the Qubit measurements from center S2. Let’s take a quick look at how they compare.

dna %>%
  mutate(across(starts_with("dna_conc"), ~log10(. + 1e-2))) %>%
  ggplot(aes(x = .panel_x, y = .panel_y, 
      color = specimen_type, fill = specimen_type)) +
  geom_point(alpha = 0.5) + 
  ggforce::geom_autodensity(alpha = 0.5, na.rm = TRUE) +
  ggforce::facet_matrix(vars(starts_with("dna_conc")), layer.diag = 2) +
  theme_minimal_grid() +
  scale_color_brewer(type = "qual") +
  scale_fill_brewer(type = "qual")

What does this imply about where certain measurements might be over/under-estimating changes in yield?

Compare normalized DNA yields

Given the general agreement between the three measurement types, let’s do some exploration using our measurements (which we have for a larger set of samples). Let’s consider yield vs expected biomass as determined by the number of pellets per aliquot.

p0 <- dna %>%
  ggplot(aes(pellets_per_aliquot, dna_conc,
      color = specimen_type, shape = extraction_batch)) +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_brewer(type = "qual", palette = 2) +
  scale_shape_manual(values = c(1, 3)) +
  theme_minimal_grid() +
  coord_fixed() +
  facet_wrap(~extraction_protocol, labeller = "label_both")
p1 <- p0 + 
  geom_quasirandom(groupOnX = TRUE, width = 0.2)
p2 <- p0 + 
  geom_text(aes(label = specimen_id), 
    position = position_quasirandom(groupOnX = TRUE, width = 0.2))
p1 / p2

The bottom panel is the same as the top but with points labeled by specimen id.

Some discussion taken from Section “Differences in DNA yields” in the planning doc:

Much greater in fecal than inoculum samples; Greater yield from protocol 1 on fecal; from protocol 2 on inoculum + T. mobilis; Angie commented that it might make sense we got more DNA from the FastDNA on fecal since that protocol was meant for stool. A possibly relevant note from the experiment is that for the DNeasy protocol, the fibrous fecal particles got stuck on the filter of the spin column; could that reduce yield in fecal samples? Batch 2 inoculum samples have larger yields due to pooling of multiple pellets.

Some other observations:

For Protocol 2, the relationship between pellets per aliquot (PPA) and yield looks roughly proportional across the two batches. For Protocol 1, it looks somewhat less than proportional for the inoculum samples, but greater than proportional for the high PPA T.mob sample. Possibly of note is that the second batch went through a further freeze-thaw cycle (all sample types) and suspension and pelleting (inoculum and T. mobilis types) (CHECK); this might be expected to increase yields though this is unclear.

Might be worth looking at the other DNA concentration measurements, which might give some evidence as to whether our concentration measurements themselves are linear at the low end.

Compute biomass increase needed to obtain fecal-like yield

y <- dna %>%
  pivot_longer(starts_with("dna_conc")) %>%
  filter(!is.na(value)) %>%
  mutate(
    value = value / pellets_per_aliquot
  ) %>%
  with_groups(
    c(specimen_type, extraction_protocol, extraction_batch, name),
    # geometric median
    summarize, across(value, ~exp(median(log(.))))
  ) %>%
  pivot_wider(names_from = specimen_type) %>%
  mutate(across(c(Inoculum, `T. mobilis`), ~ Fecal / .)) %>%
  pivot_longer(c(Inoculum, `T. mobilis`), names_to = "specimen_type")
y %>%
  ggplot(aes(extraction_batch, value, color = specimen_type)) +
  geom_quasirandom() +
  scale_y_log10(breaks = 10^seq(-2, 4)) +
  facet_grid(name~extraction_protocol, labeller = "label_both") +
  expand_limits(y = 1) +
  theme_minimal_hgrid() +
  scale_color_brewer(type = "qual", palette = 1) +
  theme(
    legend.position = "bottom",
    panel.spacing.y = unit(10, "mm")
  )

Can see that extraction protocol 2 consistent results indicate a multiplier of around 10X is needed, while extraction protocol 1 indicates a factor of 100X or more is needed, depending on which concentration measurements we use. Splitting the difference and aiming for 30X increase in input biomass per aliquot is a reasonable compromise, but shooting for 100X is a safer bet if we want to ensure that the dilution window overlaps the fecal composition in the original experiment.

Choosing mixing volumes for the mock communities

Goal is to choose mock compositions (in terms of volume of pure culture) for the 2021 experiment using the abundance profiles from the 2019 experiment.

The aim is to have three mock communities with the following properties:

I will use the measurements from all four sequencing centers to help determine mixing volumes for mocks M2 and M3.

ps <- here("output/phyloseq/2020-11-28-phyloseq.Rds") %>% 
  readRDS %>%
  mutate_sample_data(., sample_sum = sample_sums(.))
tree <- here("output/strain-data/gtdb-representatives.tree") %>%
  ape::read.tree()

Let’s filter to the inoculum strains and E. coli Nissle, and aggregate to species level.

Note that for the amplicon ASVs that match E. coli HS and E. coli Nissle 1917, we have multiple entries in organism and strain_group. This is important to keep in mind when working with the amplicon data and doing taxonomic filtering or merging.

tax <- tax_table(ps) %>% as_tibble
tax.ec <- tax %>% filter(str_detect(organism, "Esch"))
tax.ec %>% select(.otu, species:source)
#> # A tibble: 6 x 5
#>   .otu     species    organism                strain_group      source
#>   <chr>    <chr>      <chr>                   <chr>             <chr> 
#> 1 v2_A1_A… Escherich… Escherichia coli HS; E… Inoculum; Mouse … A1    
#> 2 v2_A1_A… Escherich… Escherichia coli Nissl… Mouse contaminant A1    
#> 3 v2_A2_A… Escherich… Escherichia coli HS; E… Inoculum; Mouse … A2    
#> 4 v2_A2_A… Escherich… Escherichia coli Nissl… Mouse contaminant A2    
#> 5 GCF_000… Escherich… Escherichia coli HS     Inoculum          RefSeq
#> 6 GCF_003… Escherich… Escherichia coli Nissl… Mouse contaminant RefSeq
tax0 <- tax %>%
  filter(
    strain_group == "Inoculum" | species == "Escherichia coli",
    str_detect(organism, "Esch")
  )
all.equal(tax.ec, tax0)
#> [1] TRUE
rm(tax.ec, tax0)

I also filter to the primary sample types and removal of the low-count samples from A2,

ps1 <- ps %>%
  filter_sample_data(
    specimen_type %in% c("Fecal", "Inoculum"),
    sample_sum >= 1e4 
  ) %>%
  filter_tax_table(
    strain_group == "Inoculum" | species == "Escherichia coli",
  ) %>%
  tax_glom("species") %>%
  mutate_tax_table(.otu = species)
ps1 %>% tax_table %>% as_tibble %>% select(phylum, species) %>%
  arrange(phylum) %>% kable
phylum species
Actinobacteriota Collinsella aerofaciens
Bacteroidota Bacteroides ovatus
Bacteroidota Bacteroides uniformis
Bacteroidota Bacteroides thetaiotaomicron
Bacteroidota Bacteroides caccae
Bacteroidota Barnesiella intestinihominis
Firmicutes Clostridium symbiosum
Firmicutes Roseburia intestinalis
Firmicutes Faecalibacterium prausnitzii
Firmicutes Marvinbryantia formatexigens
Firmicutes Eubacterium rectale
Proteobacteria Escherichia coli
Verrucomicrobiota Akkermansia muciniphila

I will work with the abundances as compositional vectors, with a pseudo-proportion of 1e-3 added to account for detection limits and to moderate the extreme log-fold differences that can arise due to some organisms not being present or detectable in all samples.

ps1.1 <- ps1 %>%
  transform_sample_counts(close_elts) %>%
  transform_sample_counts(~center_elts(. + 3e-3))

Let’s take a look at all the data,

p <- ps1.1 %>% as_tibble %>%
  ggplot(aes(y = abbreviate(.otu), x = .abundance, color = specimen_type, 
      shape = extraction_protocol:center_id)
  ) +
  facet_wrap(~extraction_batch,
    labeller = label_both) +
  scale_x_log10() +
  scale_color_brewer(type = "qual") +
  scale_shape_manual(values = c(1, 16, 2, 17, 5, 18, 3, 8)) +
  theme_minimal_hgrid() +
  theme(axis.title.y = element_blank())
p +
  geom_quasirandom(groupOnX = FALSE)

Note, I think the shotgun bioinformatics protocol is expected to be biased against E. coli, and we see that here for the inoculum samples but not in the fecal samples.

ps1.1 %>% 
  filter_tax_table(species == "Escherichia coli") %>%
  as_tibble %>%
  ggplot(aes(y = .otu, x = .abundance, color = specimen_type, 
      shape = extraction_protocol)
  ) +
  facet_grid(center_id~extraction_batch, labeller = label_both) +
  scale_x_log10() +
  scale_color_brewer(type = "qual") +
  scale_shape_manual(values = c(1, 16)) +
  theme_minimal_hgrid() +
  theme(
    axis.title.y = element_blank(),
    axis.text.y = element_blank()
  ) +
  geom_quasirandom(groupOnX = FALSE) +
  plot_annotation(
    title = "E. coli abundance relative to geometric mean"
  )

Let’s try to compute the compositional difference between fecal and inoculum for each batch, protocol, center combination.

#> Order by the phylogeny,
lvls <- tax %>%
  filter(.otu %in% tree$tip.label) %>%
  mutate(across(.otu, factor, levels = tree$tip.label)) %>%
  arrange(.otu) %>%
  pull(species) %>%
  unique

x <- ps1.1 %>%
  as_tibble %>%
  group_by(.otu, specimen_type, extraction_protocol, center_id,
    extraction_batch, specimen_id
  ) %>%
  summarise(across(.abundance, gm_mean), .groups = "drop_last") %>%
  summarise(across(.abundance, gm_mean), .groups = "drop") %>%
  mutate(across(.otu, factor, levels = lvls)) %>%
  arrange(.otu)
y <- x %>% 
  pivot_wider(names_from = specimen_type, values_from = .abundance) %>%
  mutate(diff = Fecal / Inoculum)
y %>%
  mutate(across(.otu, fct_reorder, diff)) %>%
  ggplot(aes(y = abbreviate(.otu), x = diff, 
      color = center_id:extraction_protocol)
  ) +
  facet_grid(~extraction_batch, labeller = label_both) +
  scale_x_log10() +
  scale_color_brewer(type = "qual", palette = 2) +
  theme_minimal_hgrid() +
  geom_quasirandom(groupOnX = F, width = 0.2) +
  theme(axis.title.y = element_blank())
  # geom_vline(xintercept = c(0.1, 1, 10))

This graph indicates that the compositional difference between fecal and inoculum samples is fairly consistent across extraction and sequencing protocol, which is a nice confirmation of our expectation from the MWC model. There are, however, some interesting modest exceptions. For example, Ec and Clss are reduced in S{1,2}:E1 protocols.

Something notable is that Clss is reduced in the second experiment relative to the first. Angie reported a much lower OD for Clss in the second inoculum batch, so this makes some sense.

Mock volumes

Let’s use the geometric average to choose the mixing ratios,

m2 <- y %>%
  with_groups(.otu, summarize, 
    across(c(Fecal, Inoculum), gm_mean)) %>%
  mutate(
    diff = Fecal / Inoculum,
    diff_sqrt = sqrt(diff)
  ) %>%
  arrange(diff)
m2 %>% kable
.otu Fecal Inoculum diff diff_sqrt
Faecalibacterium prausnitzii 0.115 2.430 0.047 0.218
Collinsella aerofaciens 0.167 1.644 0.102 0.319
Roseburia intestinalis 0.598 3.582 0.167 0.409
Escherichia coli 2.164 7.980 0.271 0.521
Marvinbryantia formatexigens 0.401 0.790 0.508 0.713
Eubacterium rectale 0.149 0.169 0.881 0.938
Clostridium symbiosum 1.295 0.774 1.674 1.294
Akkermansia muciniphila 2.444 1.173 2.084 1.444
Bacteroides caccae 1.833 0.821 2.233 1.494
Barnesiella intestinihominis 0.288 0.094 3.075 1.754
Bacteroides uniformis 5.147 1.600 3.217 1.793
Bacteroides thetaiotaomicron 5.063 1.229 4.121 2.030
Bacteroides ovatus 15.427 0.479 32.214 5.676

Simple option: make the ratios 1 for the first 6, for the next 6, and 4 for the final (Bcto).

m2 <- m2 %>%
  mutate(
    rank = rank(diff),
    ratio = case_when(
      rank <= 6 ~ 1,
      between(rank, 7, 12) ~ 2,
      rank == 13 ~ 4,
    )
  )

For the third mock, aim to make the percentage of 16S reads (more) even. Similar to Mock 2, I will use just three rounded ratio values for simplicity of pipetting.

m3 <- y %>%
  filter(center_id %in% c("A1", "A2")) %>%
  with_groups(.otu, summarize, 
    across(c(Fecal, Inoculum), gm_mean)) %>%
  mutate(
    target = 1 / Inoculum,
    target_sqrt = sqrt(target)
  ) %>%
  arrange(target)
m3 %>% kable
.otu Fecal Inoculum target target_sqrt
Escherichia coli 3.590 7.265 0.138 0.371
Roseburia intestinalis 0.724 3.398 0.294 0.542
Faecalibacterium prausnitzii 0.101 2.884 0.347 0.589
Akkermansia muciniphila 4.414 1.747 0.572 0.756
Bacteroides uniformis 4.038 1.600 0.625 0.791
Collinsella aerofaciens 0.180 1.519 0.658 0.811
Bacteroides thetaiotaomicron 3.479 1.228 0.814 0.902
Bacteroides caccae 1.538 0.975 1.025 1.013
Clostridium symbiosum 1.470 0.643 1.555 1.247
Marvinbryantia formatexigens 0.261 0.560 1.786 1.336
Bacteroides ovatus 10.894 0.539 1.857 1.363
Eubacterium rectale 0.153 0.162 6.165 2.483
Barnesiella intestinihominis 0.348 0.088 11.394 3.376
m3 <- m3 %>%
  mutate(
    rank = rank(target),
    ratio = case_when(
      rank <= 1 ~ 1,
      between(rank, 2, 11) ~ 1.5,
      rank >= 12 ~ 3,
    )
  )
m3 %>% kable
.otu Fecal Inoculum target target_sqrt rank ratio
Escherichia coli 3.590 7.265 0.138 0.371 1 1.0
Roseburia intestinalis 0.724 3.398 0.294 0.542 2 1.5
Faecalibacterium prausnitzii 0.101 2.884 0.347 0.589 3 1.5
Akkermansia muciniphila 4.414 1.747 0.572 0.756 4 1.5
Bacteroides uniformis 4.038 1.600 0.625 0.791 5 1.5
Collinsella aerofaciens 0.180 1.519 0.658 0.811 6 1.5
Bacteroides thetaiotaomicron 3.479 1.228 0.814 0.902 7 1.5
Bacteroides caccae 1.538 0.975 1.025 1.013 8 1.5
Clostridium symbiosum 1.470 0.643 1.555 1.247 9 1.5
Marvinbryantia formatexigens 0.261 0.560 1.786 1.336 10 1.5
Bacteroides ovatus 10.894 0.539 1.857 1.363 11 1.5
Eubacterium rectale 0.153 0.162 6.165 2.483 12 3.0
Barnesiella intestinihominis 0.348 0.088 11.394 3.376 13 3.0

Create a table with the ratios for all three mocks, and check the total volumes

m <- bind_rows(
  M1 = m2 %>% transmute(.otu, ratio = 1),
  M2 = m2 %>% select(.otu, ratio),
  M3 = m3 %>% select(.otu, ratio),
  .id = "mock"
) %>%
  arrange(.otu)
m %>% pivot_wider(names_from = mock, values_from = ratio) %>% kable
.otu M1 M2 M3
Escherichia coli 1 1 1.0
Bacteroides uniformis 1 2 1.5
Bacteroides ovatus 1 4 1.5
Bacteroides caccae 1 2 1.5
Bacteroides thetaiotaomicron 1 2 1.5
Barnesiella intestinihominis 1 2 3.0
Akkermansia muciniphila 1 2 1.5
Faecalibacterium prausnitzii 1 1 1.5
Marvinbryantia formatexigens 1 1 1.5
Eubacterium rectale 1 1 3.0
Roseburia intestinalis 1 1 1.5
Clostridium symbiosum 1 2 1.5
Collinsella aerofaciens 1 1 1.5
m %>% with_groups(mock, summarize, across(ratio, sum)) %>% kable
mock ratio
M1 13
M2 22
M3 22

Let’s adjust the final volumes so that the mocks all sum to ~66 mL while still having round numbers,

m_vol <- m %>% 
  mutate(
    mult = case_when(
      mock == "M1" ~ 5,
      mock != "M1" ~ 3
    ),
    # across(mult, ~ . * 2), # double for two bottle option
    volume = ratio * mult,
  )
m_vol %>% with_groups(mock, summarize, across(volume, sum)) %>% kable
mock volume
M1 65
M2 66
M3 66
m_vol %>% with_groups(.otu, summarize, across(volume, sum)) %>% kable
.otu volume
Escherichia coli 11.0
Bacteroides uniformis 15.5
Bacteroides ovatus 21.5
Bacteroides caccae 15.5
Bacteroides thetaiotaomicron 15.5
Barnesiella intestinihominis 20.0
Akkermansia muciniphila 15.5
Faecalibacterium prausnitzii 12.5
Marvinbryantia formatexigens 12.5
Eubacterium rectale 17.0
Roseburia intestinalis 12.5
Clostridium symbiosum 15.5
Collinsella aerofaciens 12.5
m_vol_wide <- m_vol %>% 
  select(.otu, mock, volume) %>%
  pivot_wider(names_from = mock, values_from = volume)
m_vol_wide %>% kable
.otu M1 M2 M3
Escherichia coli 5 3 3.0
Bacteroides uniformis 5 6 4.5
Bacteroides ovatus 5 12 4.5
Bacteroides caccae 5 6 4.5
Bacteroides thetaiotaomicron 5 6 4.5
Barnesiella intestinihominis 5 6 9.0
Akkermansia muciniphila 5 6 4.5
Faecalibacterium prausnitzii 5 3 4.5
Marvinbryantia formatexigens 5 3 4.5
Eubacterium rectale 5 3 9.0
Roseburia intestinalis 5 3 4.5
Clostridium symbiosum 5 6 4.5
Collinsella aerofaciens 5 3 4.5
ss <- "https://docs.google.com/spreadsheets/d/1mqrCoBj7y2lajEY3VyiPOy98S9HAY-YRKyQLEKI4LOs"
m_vol_wide %>%
  googlesheets4::sheet_write(ss, sheet = "[R export] Mixing volumes")

Inspecting impact of sample splitting

In the fecal samples, aliquots 1 and 3 were extrated by protocol 1, and aliquots 2 and 4 by protocol 2. Our analysis assumes that all four aliquots are equivelent prior to DNA extraction. Is there any evidence to the contrary, in which case we might want to randomize the protocol treatment across aliquots in the new experiment?

Effect of aliquot number on DNA yield in fecal samples

Did our homogenization and aliquot-assignment procedure led to any systematic differences in the DNA yield between the first and second aliquot for a given specimen-protocol combination?

z <- dna %>%
  filter(specimen_type == "Fecal") %>%
  with_groups(c(specimen_id, extraction_protocol), mutate,
    aliquot_rank = rank(aliquot_number) %>% factor)
z %>%
  ggplot(aes(y = aliquot_rank, x = dna_conc)) +
  facet_grid(extraction_batch ~ extraction_protocol) +
  geom_quasirandom(groupOnX = FALSE) +
  scale_x_log10()

Let’s plot the concentration from the second aliquot against the first,

z %>%
  select(specimen_id, starts_with("extraction"), dna_conc, aliquot_rank) %>%
  mutate(across(aliquot_rank, fct_recode, first = "1", second = "2")) %>%
  pivot_wider(names_from = aliquot_rank, values_from = dna_conc) %>%
  ggplot(aes(y = second, x = first, 
      color = extraction_protocol:extraction_batch)) +
  scale_x_log10() + scale_y_log10() + coord_fixed() +
  scale_color_brewer(type = "qual", palette = "Paired") +
  theme(
    legend.position = "bottom"
  ) +
  geom_abline(color = "grey") +
  geom_point()

There appears to be a 1:1 relationship between yields from the first to second aliquot of a given specimen-protocol set. This result suggests that the homogenization procedure successfully homogenized aliquots 1 and 3, and aliquots 2 and 4.

Effect of aliquot number on composition in fecal samples

I looked at this question in the pilot data (A1 data, v1 pipeline) in analysis/2020-02-29-homogenization.html, and will take another look here using all the data.

One simple thing we can try is to compare the CLR values between the first and second aliquots of a given specimen-protocol combination, as with DNA yield above.

x <- ps1.1 %>%
  as_tibble %>%
  mutate(
    aliquot_rank = case_when(
      aliquot_number %in% 1:2 ~ "First",
      aliquot_number %in% 3:4 ~ "Second",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(aliquot_rank)) %>%
  select(.otu, .abundance, center_id, specimen_id, starts_with("extraction"),
    aliquot_rank) %>%
  pivot_wider(names_from = aliquot_rank, values_from = .abundance)
x %>%
  filter(.otu == "Bacteroides ovatus") %>%
  count(is.na(First), is.na(Second))
#> # A tibble: 3 x 3
#>   `is.na(First)` `is.na(Second)`     n
#>   <lgl>          <lgl>           <int>
#> 1 FALSE          FALSE             148
#> 2 FALSE          TRUE               33
#> 3 TRUE           FALSE               6

A number of cases do not have both First and Second values; perhaps this is due to the samples that were filtered from low read counts or left out of the centers that received fewer than 96 samples?

x %>%
  filter(.otu == "Bacteroides ovatus") %>%
  count(center_id, is.na(First), is.na(Second))
#> # A tibble: 9 x 4
#>   center_id `is.na(First)` `is.na(Second)`     n
#>   <fct>     <lgl>          <lgl>           <int>
#> 1 A1        FALSE          FALSE              42
#> 2 A1        FALSE          TRUE                6
#> 3 A2        FALSE          FALSE              28
#> 4 A2        FALSE          TRUE               10
#> 5 A2        TRUE           FALSE               6
#> 6 S1        FALSE          FALSE              36
#> 7 S1        FALSE          TRUE               11
#> 8 S2        FALSE          FALSE              42
#> 9 S2        FALSE          TRUE                6
Click for figure
x %>%
  ggplot(aes(First, Second, color = extraction_protocol:extraction_batch)) +
  facet_grid(abbreviate(.otu) ~ center_id) +
  scale_x_log10() + scale_y_log10() + coord_fixed() +
  scale_color_brewer(type = "qual", palette = "Paired") +
  theme(
    legend.position = "bottom"
  ) +
  geom_abline(color = "grey") +
  geom_point()

We see a strong 1:1 relationship across taxa, suggesting that our procedure led to replicate aliquots that behaved approximately the same in the experiment.

Generate sample data and plate layout

First, get a table with the basic info for each specimen

mocks <- tibble(mixture = str_c("M", 1:3)) %>%
  crossing(feces_added = c(TRUE, FALSE)) %>%
  mutate(
    specimen_base_id = str_c(mixture, ifelse(feces_added, "F", "N")),
    specimen_type = ifelse(feces_added, "Mock in feces", "Mock only")
  )
gnot <- tibble(
  specimen_base_id = str_c("F", 1:3),
  specimen_type = "Fecal"
)
tmob <- tibble(
  specimen_base_id = "T1",
  specimen_type = "T. mobilis",
  dilution_power = 0L,
  feces_added = FALSE
)
all_specimens <- bind_rows(mocks, gnot) %>%
  crossing(
    dilution_power = 0:2
  ) %>%
  bind_rows(tmob) %>%
  mutate(
    dilution_label = str_c("D", dilution_power),
    dilution_factor = 10^dilution_power,
    specimen_name = str_c(sep = "-",
      specimen_base_id,
      dilution_label
    )
  ) %>%
  relocate(specimen_name) %>%
  glimpse
#> Rows: 28
#> Columns: 8
#> $ specimen_name    <chr> "M1N-D0", "M1N-D1", "M1N-D2", "M1F-D0", "M1…
#> $ mixture          <chr> "M1", "M1", "M1", "M1", "M1", "M1", "M2", "…
#> $ feces_added      <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALS…
#> $ specimen_base_id <chr> "M1N", "M1N", "M1N", "M1F", "M1F", "M1F", "…
#> $ specimen_type    <chr> "Mock only", "Mock only", "Mock only", "Moc…
#> $ dilution_power   <int> 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2…
#> $ dilution_label   <chr> "D0", "D1", "D2", "D0", "D1", "D2", "D0", "…
#> $ dilution_factor  <dbl> 1, 10, 100, 1, 10, 100, 1, 10, 100, 1, 10, …

The DNA samples correspond to the four aliquots obtained from each specimen (with the last two aliquots not used for specimens derived from M3 and F3),

dna_samples <- all_specimens %>%
  crossing(aliquot_number = 1:4) %>%
  # Remove aliquots 3 and 4 from specimens not getting replicates (M3, F3)
  filter(
    ! (specimen_base_id %in% c("M3N", "M3F", "F3") & aliquot_number %in% 3:4)
  ) %>%
  mutate(
    dna_sample_id = str_c(sep = "-", specimen_name, aliquot_number),
    extraction_protocol = ifelse(aliquot_number %in% c(1, 3), 1, 2) %>% factor
  ) %>%
  relocate(dna_sample_id)

Well assignment is simplified this time since we are using all 94 samples for all sequencing centers. Wells H11 and H12 are reserved for A1 controls and will be left blank in all plates. I will manually assign the T. mobilis samples, then randomly assign others. This time, I will put T. mobilis controls in distinct rows and columns, roughly evenly spaced across the plate. Rows: A, C, E, G; Cols: 1, 4, 7, 10. In A1, the Zymo control will be in H11 or H12; I’ll put the T. mobilis samples going from lower left (A1) to upper right (G10) to space away from the Zymo sample.

set.seed(4)

reserved_wells <- str_c("H", 11:12)
tmob_wells <- dna_samples %>%
  filter(specimen_type == "T. mobilis") %>%
  arrange(aliquot_number) %>%
  mutate(
    row = LETTERS[c(1, 3, 5, 7)],
    column = seq(1, 12, by = 3) %>% as.integer %>% rev,
    well = str_c(row, column)
  )
other_samples <- dna_samples %>% 
  filter(specimen_type != "T. mobilis") %>%
  pull(dna_sample_id)
other_wells <- crossing(row = LETTERS[1:8], column = 1:12) %>%
  mutate(well = str_c(row, column)) %>%
  filter(!(well %in% c(tmob_wells$well, reserved_wells))) %>%
  mutate(dna_sample_id = sample(other_samples)) %>%
  left_join(dna_samples, by = "dna_sample_id")
plate_layout <- bind_rows(other_wells, tmob_wells) %>%
  arrange(dna_sample_id) %>%
  mutate(
    across(row, ordered, levels = LETTERS[1:8]),
    across(column, ordered, levels = seq(12))
  )
plate_layout %>%
  mutate(across(row, fct_rev)) %>%
  ggplot(aes(y = row, x = column, fill = specimen_type)) +
  coord_fixed() +
  geom_tile() +
  geom_text(aes(label = specimen_base_id), size = 3) +
  scale_fill_brewer(type = "qual", palette = 2) + 
  labs(title = "Plate layout") +
  theme(
    legend.position = "bottom"
  )

Looks good!

Let’s save the DNA sample data with well assignments in a Google Sheet. I’ll also save the well assignments in the form of a 12 by 8 grid with the full DNA sample names, to refer to during the actual plating.

ss <- "https://docs.google.com/spreadsheets/d/10DNhnhE17lPOSIvzbhYxDTG--9KDMkM9Eum9Psb9B10"
plate_layout %>%
  googlesheets4::sheet_write(ss, sheet = "[R export] DNA sample names and wells")
plate_layout %>%
  select(dna_sample_id, row, column) %>%
  pivot_wider(names_from = column, values_from = dna_sample_id,
    names_sort = TRUE) %>% 
  arrange(row) %>%
  rename(`row//column` = row) %>%
  googlesheets4::sheet_write(ss, sheet = "[R export] Plate layout")

Session info

Click for session info
sessioninfo::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Arch Linux                  
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2021-06-29                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────────────
#>  package          * version    date       lib source                           
#>  ade4               1.7-16     2020-10-28 [1] CRAN (R 4.0.3)                   
#>  ape                5.5        2021-04-25 [1] CRAN (R 4.1.0)                   
#>  askpass            1.1        2019-01-13 [1] CRAN (R 4.0.0)                   
#>  assertthat         0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                   
#>  backports          1.2.1      2020-12-09 [1] CRAN (R 4.0.3)                   
#>  beeswarm           0.4.0      2021-06-01 [1] CRAN (R 4.1.0)                   
#>  Biobase            2.52.0     2021-05-19 [1] Bioconductor                     
#>  BiocGenerics       0.38.0     2021-05-19 [1] Bioconductor                     
#>  biomformat         1.20.0     2021-05-19 [1] Bioconductor                     
#>  Biostrings         2.60.0     2021-05-19 [1] Bioconductor                     
#>  bitops             1.0-7      2021-04-24 [1] CRAN (R 4.1.0)                   
#>  broom              0.7.6      2021-04-05 [1] CRAN (R 4.0.5)                   
#>  bslib              0.2.5.1    2021-05-18 [1] CRAN (R 4.1.0)                   
#>  cellranger         1.1.0      2016-07-27 [1] CRAN (R 4.0.0)                   
#>  cli                2.5.0      2021-04-26 [1] CRAN (R 4.1.0)                   
#>  cluster            2.1.2      2021-04-17 [2] CRAN (R 4.1.0)                   
#>  codetools          0.2-18     2020-11-04 [2] CRAN (R 4.1.0)                   
#>  colorspace         2.0-2      2021-06-24 [1] CRAN (R 4.1.0)                   
#>  cowplot          * 1.1.1      2020-12-30 [1] CRAN (R 4.0.4)                   
#>  crayon             1.4.1      2021-02-08 [1] CRAN (R 4.0.4)                   
#>  curl               4.3.1      2021-04-30 [1] CRAN (R 4.1.0)                   
#>  data.table         1.14.0     2021-02-21 [1] CRAN (R 4.0.4)                   
#>  DBI                1.1.1      2021-01-15 [1] CRAN (R 4.0.4)                   
#>  dbplyr             2.1.1      2021-04-06 [1] CRAN (R 4.0.5)                   
#>  digest             0.6.27     2020-10-24 [1] CRAN (R 4.0.3)                   
#>  distill            1.2        2021-01-13 [1] CRAN (R 4.1.0)                   
#>  downlit            0.2.1      2020-11-04 [1] CRAN (R 4.0.3)                   
#>  dplyr            * 1.0.7      2021-06-18 [1] CRAN (R 4.1.0)                   
#>  ellipsis           0.3.2      2021-04-29 [1] CRAN (R 4.1.0)                   
#>  evaluate           0.14       2019-05-28 [1] CRAN (R 4.0.0)                   
#>  fansi              0.5.0      2021-05-25 [1] CRAN (R 4.1.0)                   
#>  farver             2.1.0      2021-02-28 [1] CRAN (R 4.0.4)                   
#>  forcats          * 0.5.1      2021-01-27 [1] CRAN (R 4.0.4)                   
#>  foreach            1.5.1      2020-10-15 [1] CRAN (R 4.0.3)                   
#>  fs                 1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                   
#>  gargle             1.1.0      2021-04-02 [1] CRAN (R 4.0.5)                   
#>  generics           0.1.0      2020-10-31 [1] CRAN (R 4.0.3)                   
#>  GenomeInfoDb       1.28.0     2021-05-19 [1] Bioconductor                     
#>  GenomeInfoDbData   1.2.6      2021-05-31 [1] Bioconductor                     
#>  ggbeeswarm       * 0.6.0      2017-08-07 [1] CRAN (R 4.0.0)                   
#>  ggforce            0.3.3      2021-03-05 [1] CRAN (R 4.0.4)                   
#>  ggplot2          * 3.3.5      2021-06-25 [1] CRAN (R 4.1.0)                   
#>  glue               1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                   
#>  googledrive        1.0.1      2020-05-05 [1] CRAN (R 4.0.0)                   
#>  googlesheets4      0.3.0      2021-03-04 [1] CRAN (R 4.1.0)                   
#>  gtable             0.3.0      2019-03-25 [1] CRAN (R 4.0.0)                   
#>  haven              2.4.1      2021-04-23 [1] CRAN (R 4.1.0)                   
#>  here             * 1.0.1      2020-12-13 [1] CRAN (R 4.0.5)                   
#>  highr              0.9        2021-04-16 [1] CRAN (R 4.1.0)                   
#>  hms                1.1.0      2021-05-17 [1] CRAN (R 4.1.0)                   
#>  htmltools          0.5.1.1    2021-01-22 [1] CRAN (R 4.0.3)                   
#>  httr               1.4.2      2020-07-20 [1] CRAN (R 4.0.2)                   
#>  igraph             1.2.6      2020-10-06 [1] CRAN (R 4.0.3)                   
#>  import             1.2.0      2020-09-24 [1] CRAN (R 4.0.2)                   
#>  IRanges            2.26.0     2021-05-19 [1] Bioconductor                     
#>  iterators          1.0.13     2020-10-15 [1] CRAN (R 4.0.3)                   
#>  jquerylib          0.1.4      2021-04-26 [1] CRAN (R 4.1.0)                   
#>  jsonlite           1.7.2      2020-12-09 [1] CRAN (R 4.0.3)                   
#>  kableExtra       * 1.3.4      2021-02-20 [1] CRAN (R 4.0.4)                   
#>  knitr              1.33       2021-04-24 [1] CRAN (R 4.1.0)                   
#>  labeling           0.4.2      2020-10-20 [1] CRAN (R 4.0.3)                   
#>  lattice            0.20-44    2021-05-02 [2] CRAN (R 4.1.0)                   
#>  lifecycle          1.0.0      2021-02-15 [1] CRAN (R 4.0.4)                   
#>  lubridate          1.7.10     2021-02-26 [1] CRAN (R 4.0.4)                   
#>  magrittr           2.0.1      2020-11-17 [1] CRAN (R 4.0.3)                   
#>  MASS               7.3-54     2021-05-03 [2] CRAN (R 4.1.0)                   
#>  Matrix             1.3-3      2021-05-04 [2] CRAN (R 4.1.0)                   
#>  metacal          * 0.2.0.9001 2021-06-15 [1] Github (mikemc/metacal@a7a87a1)  
#>  mgcv               1.8-35     2021-04-18 [2] CRAN (R 4.1.0)                   
#>  modelr             0.1.8      2020-05-19 [1] CRAN (R 4.0.0)                   
#>  multtest           2.48.0     2021-05-19 [1] Bioconductor                     
#>  munsell            0.5.0      2018-06-12 [1] CRAN (R 4.0.0)                   
#>  nlme               3.1-152    2021-02-04 [2] CRAN (R 4.1.0)                   
#>  nvimcom          * 0.9-102    2021-06-15 [1] local                            
#>  openssl            1.4.4      2021-04-30 [1] CRAN (R 4.1.0)                   
#>  patchwork        * 1.1.1      2020-12-17 [1] CRAN (R 4.0.3)                   
#>  permute            0.9-5      2019-03-12 [1] CRAN (R 4.0.0)                   
#>  phyloseq         * 1.36.0     2021-05-19 [1] Bioconductor                     
#>  pillar             1.6.1      2021-05-16 [1] CRAN (R 4.1.0)                   
#>  pkgconfig          2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                   
#>  plyr               1.8.6      2020-03-03 [1] CRAN (R 4.0.0)                   
#>  polyclip           1.10-0     2019-03-14 [1] CRAN (R 4.0.0)                   
#>  prettyunits        1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                   
#>  progress           1.2.2      2019-05-16 [1] CRAN (R 4.0.2)                   
#>  ps                 1.6.0      2021-02-28 [1] CRAN (R 4.0.4)                   
#>  purrr            * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                   
#>  R6                 2.5.0      2020-10-28 [1] CRAN (R 4.0.3)                   
#>  rappdirs           0.3.3      2021-01-31 [1] CRAN (R 4.0.4)                   
#>  RColorBrewer       1.1-2      2014-12-07 [1] CRAN (R 4.0.0)                   
#>  Rcpp               1.0.6      2021-01-15 [1] CRAN (R 4.0.3)                   
#>  RCurl              1.98-1.3   2021-03-16 [1] CRAN (R 4.0.5)                   
#>  readr            * 1.4.0      2020-10-05 [1] CRAN (R 4.0.3)                   
#>  readxl             1.3.1      2019-03-13 [1] CRAN (R 4.0.0)                   
#>  reprex             2.0.0      2021-04-02 [1] CRAN (R 4.0.5)                   
#>  reshape2           1.4.4      2020-04-09 [1] CRAN (R 4.0.0)                   
#>  rhdf5              2.36.0     2021-05-19 [1] Bioconductor                     
#>  rhdf5filters       1.4.0      2021-05-19 [1] Bioconductor                     
#>  Rhdf5lib           1.14.0     2021-05-19 [1] Bioconductor                     
#>  rlang              0.4.11     2021-04-30 [1] CRAN (R 4.1.0)                   
#>  rmarkdown        * 2.8        2021-05-07 [1] CRAN (R 4.1.0)                   
#>  rprojroot          2.0.2      2020-11-15 [1] CRAN (R 4.0.3)                   
#>  rstudioapi         0.13       2020-11-12 [1] CRAN (R 4.0.3)                   
#>  rvest              1.0.0      2021-03-09 [1] CRAN (R 4.0.5)                   
#>  S4Vectors          0.30.0     2021-05-19 [1] Bioconductor                     
#>  sass               0.4.0      2021-05-12 [1] CRAN (R 4.1.0)                   
#>  scales             1.1.1      2020-05-11 [1] CRAN (R 4.0.0)                   
#>  sessioninfo        1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                   
#>  speedyseq        * 0.5.3.9018 2021-06-29 [1] Github (mikemc/speedyseq@ceb941f)
#>  stringi            1.6.2      2021-05-17 [1] CRAN (R 4.1.0)                   
#>  stringr          * 1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                   
#>  survival           3.2-11     2021-04-26 [2] CRAN (R 4.1.0)                   
#>  svglite            2.0.0      2021-02-20 [1] CRAN (R 4.0.4)                   
#>  systemfonts        1.0.2      2021-05-11 [1] CRAN (R 4.1.0)                   
#>  tibble           * 3.1.2      2021-05-16 [1] CRAN (R 4.1.0)                   
#>  tidyr            * 1.1.3      2021-03-03 [1] CRAN (R 4.0.4)                   
#>  tidyselect         1.1.1      2021-04-30 [1] CRAN (R 4.1.0)                   
#>  tidyverse        * 1.3.1      2021-04-15 [1] CRAN (R 4.1.0)                   
#>  tweenr             1.0.2      2021-03-23 [1] CRAN (R 4.0.5)                   
#>  useful             1.2.6      2018-10-08 [1] CRAN (R 4.0.0)                   
#>  utf8               1.2.1      2021-03-12 [1] CRAN (R 4.0.5)                   
#>  vctrs              0.3.8      2021-04-29 [1] CRAN (R 4.1.0)                   
#>  vegan              2.5-7      2020-11-28 [1] CRAN (R 4.0.3)                   
#>  vipor              0.4.5      2017-03-22 [1] CRAN (R 4.0.0)                   
#>  viridisLite        0.4.0      2021-04-13 [1] CRAN (R 4.0.5)                   
#>  webshot            0.5.2      2019-11-22 [1] CRAN (R 4.0.2)                   
#>  withr              2.4.2      2021-04-18 [1] CRAN (R 4.0.5)                   
#>  xfun               0.23       2021-05-15 [1] CRAN (R 4.1.0)                   
#>  xml2               1.3.2      2020-04-23 [1] CRAN (R 4.0.0)                   
#>  XVector            0.32.0     2021-05-19 [1] Bioconductor                     
#>  yaml               2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                   
#>  zlibbioc           1.38.0     2021-05-19 [1] Bioconductor                     
#> 
#> [1] /home/michael/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/lib/R/library